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Abstract 

We report the design and implementation of a call- 
graph profiler for GNU Octave, a numerical computing 
platform. GNU Octave simplifies matrix computation for 
use in modeling or simulation. Our work provides a call- 
graph profiler, which is an improvement on the fiat profiler. 
We elaborate design constraints of building a profiler for 
numerical computation, and benchmark the profiler by 
comparing it to the rudimentary timer start-stop (tic-toe) 
measurements, for a similar set of programs. The profiler 
code provides clean interfaces to internals of GNU Octave, 
for other (newer) profiling tools on GNU Octave. 



I. Introduction 

GNU Octave is a numerical computing platform based 
on programming with matrices as a fundamental construct. 
Several modeling and simulation research work benefit 
through the reduced time and complexity offered by ma- 
trix programming environments. Modeling and simulation 
applications when developed through exploratory analysis, 
or initial speculative analysis of parameters, optimization 
in not the key goal. However, once a stable model is 
created, running parametric search over a large search 
space requires the creation of optimized models. 

Optimizing computational models written in matrix 
languages, for time and space usage, help to directly speed 
up the search process by reducing time and resources 
(memory, diskspace). There are many ways of optimizing 
program performance, like tracing, logging, and profiling. 
Profiling is the only method where the runtime is propor- 



tional to program size, unlike the other methods which 
depend on the execution time of the program. 

Clearly, for large scale parametric searches, we cannot 
use tracing or logging for optimization, due to enormous 
log data that needs post-analysis. This makes profiling as 
an efficient tool for program optimization. Profiling for 
matrix programming languages hke GNU Octave, help 
identify resource usage, inefficient function usage, program 
flow, piecemeal function run-times and a complete idea of 
the running times spent on subroutines as a percentage of 
the program lifetime. 

The landmark work in profiling is the GNU Debugger 
(GDB) [1], which initiated the idea of a Call-Graph pro- 
filer Since GDB, many types of profilers have been pro- 
posed and built, including dynamic instrumented profilers, 
static sampling profilers, flat profilers, call graph profilers. 

The Java Virtual Machine (JVM) provides a complete 
infrastructure for dynamic program analysis, and encour- 
ages custom built profilers for querying and collecting 
statistics from the JVM [2]. This is called the Java Virtual 
Machine profiling Interface (JVMPI), and represents the 
state-of-the art in profiling. Using JVMPI many successful 
profilers can be built for generic profiling or integrated into 
the existing applications. 

We present a call-graph profiler for matrix based lan- 
guages, implemented on the freely available (Open Source) 
GNU Octave platform. With knowledge of Octave inter- 
nals [6], we chose GNU Octave platform for creating 
profiler Our work addresses concerns of optimization 
of numerical computational models, discrete simulations, 
and exploratory analysis, through the use of profiling. 
Important metrics for the profiler are minimal sampling- 
time, minimal memory and resource usage from profiler, 
dynamic data collection, meaningful output presentation. 



We build a dynamic instrumentation profiler, with dispatch 
built into the Octave interpreter, which allows us to create 
profilers of increasing complexity from Flat profiler till the 
Call-Graph profiler. 

The terminology in use while describing the profiler 
statistics are described as follows 

1) Total time: time taken for the subroutine to run, 
excluding the runtime of subroutines it calls. This 
includes only the computational times, and not the 
times for functions called during computation. 

2) Self time: the run- time of a subroutine including its 
calls to the sub-functions. This is greater than or 
equal to the total-time of the subroutine. 

3) Average time: the average of the total times, over all 
calls to the subroutine. The self time, and total time 
are reported as averages over the total calls. 

4) Percentage time: the fraction of the program runtime 
for which the given subroutine has been executing. 

5) Number of calls: total number of times a subroutine 
was called. 

6) Cumulative time: This gives the sum of the times of 
all subroutines, that have lesser self time than the 
current function. Results are sorted in descending 
order of self time, and increasing cumulative time. 

In this paper, we use the terms function and subroutine 
interchangeably. 

The rest of the paper is organized as follows. In Sec 
II, we report the profiling API required to create profilers 
on GNU Octave. The algorithms used for the Flat profiler 
and the Call-Graph profiler are discussed in Sec IE, IV. 
Benchmarking results of the profilers are presented in Sec 
V, to illustrate accuracy and confidence on the profiled 
results. Finally, in Sec VI, we review the features and 
Umitations of our profiler framework for Octave, and 
indicate the future work required to integrate the profiler 
into the main GNU Octave project. 

II. Octave Profiler API 

Most profiling systems work by collecting data of a 
program at runtime. Profilers collect statistics, and the time 
of occurrence for each piece of information. To enable 
collection of profiling events like function invocation, 
object creation, deletion, function call completion, and 
such interpreter-system specific data, profilers need an 
event delivery mechanism from the interpreter itself. This 
makes building a profiler for GNU Octave language as two 
tasks; building interpreter API event delivery mechanism 
to the profiler, and then building profilers to make use of 
the events. This separation of concerns, was inspired from 
the profiler design of languages like Python [3], and Ruby 
[4]. The rest of this section describes the events of interest 
to our profiler, and API for the event delivery system. 

The simplest profiler (Flat profilers. Sec III), need in- 
formation about the occurrence of two events; the function 



call, and function return. Since Octave is not an Object 
oriented system we do not have Object creation, deletion, 
nor member function invocation and such events cannot 
be reported. In a more complicated profiler (Call-Graph 
profiler. Sec IV), the same events (function call, return) 
have attributes hke the function-caller and function-callee 
passed to the profiler program. We do not trap events 
hke variable state changed, entering a program section 
corresponding to line number in source code, or leaving 
such a program section, etc. The impUcations of these 
events are discussed in Sec VI. 

The API design of the event-reporting system is based 
on a global singleton class octave 4)rofiler, which is instan- 
tiated by the Octave interpreter. The class octave profiler, 
has a member variable profiler Jen which serves as a event 
reporter function. This is set by the profiler program, 
using a call to static tool set -profiler(profiler Junction 
profile);. When set by a profiler, the profiler Jen is invoked 
from within the Octave interpreter whenever execution 
invokes a function or returns from a function. The profiler 
stops profiling, by invoking the static tool clear^rofiler(); 
which returns the handle profiler Jen to a nuU, and prevents 
event delivery. 

The event delivery is managed by function 
static void sendjevent (const octave Junction fen, pro- 
filer Junction Jype ftype, profiler jcalljitate estate); 
from the internals of the Octave interpreter. In this invo- 
cation fen is the function in question, and estate indicates 
execution has entered or returned from a function. 

This sendjevent function abstracts the call sequence 
profile Jcn(fcn,ftype, estate); by keeping it private. 

This API in short provides a hook for the profilers to 
start/stop receiving events, and a base class that imple- 
ments this mechanism. Function profilers are written by 
deriving from the primitive base class profile Jjase. The 
base class provides timing information, and the setup tear- 
down for the events as discussed above. 

The user level profiler interface is through the function 
profile, which has usage : profile [on|of%ifo] [graph|flat], 
where the options are indicated in square brackets. A 
typical use case would be: 

1) profile on graph %use a call-graph profiler 

2) bch() %invoke the script 

3) profile info %stop profiling & printout the results. 

The profile is responsible for creating instances of Call- 
Graph or Flat profilers, and pass on the start, stop, print 
requests to these profilers using API of the profiler, as 

1) static void start^rofilingO 

2) static void stop ^rofiling( ) 

3) void print ^rofile( ) 

The profilers to extend the base class profileJbase. This 
completes the description of the Octave side profiling API, 
and the user interface for profilers. 



III. Flat-File Proffler 

Flat-file profilers are very simple, and only count 
the average statistics of the program. The only attribute 
information saved by Flat-profilers are saved in the 
structure, calLelem, which has fields 

1) long int ncalls; 

2) double totaLtime; 

3) double self_time; 

4) std:: string key; 

for each function that is invoked in the program. The 
meaning of each field, is self explanatory and the terms 
are defined in the introduction. Nowhere are absolute 
times measured, and programs rely on relative time 
separation of events. The timing information is stored in 
the class objects of time-elem. 

1) double delta; //incremental time from the previous 
routine that a function is called 

2) double tick; //time for running our kids/child func- 
tions 

The relative times of events noted down at each event, are 
finally added up to obtain the statistics for Flat-profiler 
output. 

A. Implementation 

The class profiler Jlat implements the Flat-profiler, by 
extending profile Jjase. The algorithm is implemented us- 
ing a stack data structure. The essential algorithm is the 
same for the complex Call-Graph profiler too. 

The profiler dispatch function delivers the function 
invoke and return events by interfacing with the profiler 
API, describe in Sec II. The profiler function for the Flat- 
profilers is given as, 

static void profile ^unc( const octave Junction *fcn, pro- 
filer JitnctionJype ftype, profiler jcall Mate estate); 

This function further delivers the events to the particular 
call-processing routines that handle event call, return sepa- 
rately. The function profile Junc() is modeled as a template 
pattern, that delegates the events to particular handlers. 

B. Algorithm 

The algorithm for the Flat-profiler is summarized as, 

1) When profiler is started, note starting time. 

2) Register the profile event handler 

3) On Call Event: 

Push a time-elem instance set to zero, into time- 
stack.This time-stack is a false call-stack, as it mir- 
rors the interpreter call-stack, functions are invoked 
and returned. 

4) On Return Event: 



a) Check if hashtable has an instance of record 
for the given function. Otherwise create a new 
calLelem instance for this function and set the 
name to the function. 

b) Increase the number of calls on this record by 
1. 

c) Compute the relative time difference between 
the call and return events; Use the time_elem 
object on the top of time-stack. 

d) Add the total time to the call record's corre- 
sponding field. 

e) Add the self time to the call record's corre- 
sponding field. Compute self time by subtract- 
ing from total time, the value of tick. 

f) Update the record in the hashtable, indexed by 
the function name as key. 

g) If the call-stack of time, is not empty add the 
cost of this call, to the parent in the parents, 
time element tick field. 

5) Repeat the steps 3-4, till stopped. 

6) Clear the profihng handler, and receive no more 
events. 

7) Once profiling is stopped, prepare to print output. 
Compute % times. 

8) Sort the hashtable entries according to the total-time 
field of record. 

9) Print out according to descending order of total 
times. 

It is important to note the source of this algorithm is ob- 
tained from profilers for popular programming languages 
[3], [4]. We attribute the idea to the Python, and Ruby 
implementations . 

IV. Call-Graph Profiler 

Call-Graph profiler builds the profiling output with the 
program execution,as a directed graph with arcs. The arcs 
point from the caller to the callee, and conveys the time 
of execution of the callee function. Second order statistics 
and more than averages can be obtained by sifting through 
the profiling data, and it becomes much valuable than Flat 
profiling. 

There are particular cases where Flat profiling informa- 
tion is not helpful; in general numerical routine execution 
times depend on the size of the input argument, and the 
average total time used for routines that are not 0(1), skew 
the profiled data. Call-Graph profilers side step such prob- 
lems by assigning second order statistics, which include the 
self , average, and total times for each arc of a function call, 
and profiled function's complexity can be clearly observed 
without skewing the data. From definition of a Call-Graph, 
the parent-child relationships (caller-callee relations) from 
the profiled information are also inmiediately available. 

It is to be noted, that in our implementation not every 



parent-child relationship is saved,and the data is averaged 
for each unique caller-callee information, in order to 
reduce the profiler output to a meaningful subset. 

Data structures derived from calLelem, and time_elem 
with extra variables,to contain the caller-callee relationship 
records are used. 

A. Implementation 

The Call-Graph profiler is implemented in the class 
profile j^allgraph which as in the Flat-profiler derives from 
the profile Jbase class. 

The Call-Graph profiler is itself, so to speak, an in- 
cremental improvement over the Flat-profiler It's profiler 
event reported, dispatch and logging mechanisms are sim- 
ilar to Flat-profiler, and not reiterated here. 

1) Algorithm: Much of the algorithm of the call-graph 
profiler is very-similar to the Flat-profiler. The differences 
remain; 

1) Function call-event: when call-stack is empty, all call 
events are added to the toplevel callee hashtable. 
This saves the caller-callee information. 

2) Function return-event: the returning function is 
added as a callee to top-of-stack (TOS).Then the 
caller for this returning function set in the hash-table, 
and its timing record updated. Similarly the callee for 
the TOS function is set as the returning function, and 
the caller records updated. 

3) Printing: the data is printed out as a tree, after sorting 
according to descending order 

The printing of results follows a tree like pattern, illus- 
trating the Call-Graph nature of the program execution. 

V. Results & Discussion 

To evaluate the Flat and Call-Graph profilers a test 
case comprising of a communication system simulation 
program was evaluated. The program, and associated files 
were about 1672 Unes of Octave code, excluding com- 
ments. This code set is chosen for its availability as much 
as its similar performance on the Flat-profiler and the Call- 
Graph profilers, due to the constant input modulation sizes 
used all over the simulation program. 

The profiling is carried on at the toplevel program using 
the sequence of calls to profile function mentioned in Sec 
ffl 

A. Flat profiler results 

The Flat profiler, gives an average performance of 
the functions across the runtime of the program. The 
run time of the programs are reported in seconds, while 
the ms/call indicates milliseconds/call. It should be noted 
that measured results are more finely-granular than the 
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Fig. 1. Call-Graph profiler output 

ones reported. Results are rounded-off due to formatting 
constraints. 

From the results in Table 1, we see that CPU hogging 
function is GF_add which takes about 29% of the 
program runtime. This information, along with the self- 
times and number of calls can be used to arrive at possible 
optimization candidates. 

The Call-Graph profiler output is more involved as 
shown in Fig 1 . The important difference is, the call times, 
count information are collected as the caller-callee basis, 
and reported so. The Call-Graph profiler in this case helps 
to identify functions that perform on a O(n^) complexity 
basis. 

The voluminous output of the Call-Graph profiler is 
reduced to the first few lines for brevity sake, is presented 
in the Figure 1 . The same benchmark program was run on 
the Call-Graph profiler as well. 

B. Profiling Overhead 

There is a significant performance hit due to the pro- 
filing. In our design, we explicitly compensate for the 
profiler runtime, and this is not a problem. The reported 
overhead times are found after compensation, and for the 
Flat-profiler. 

The reported overhead time can only be accounted for, 
using a free parameter computed before profiling on each 
profiling session. This is called the bias value, as reported 
in the Python profiler [3]. In our profiler design we do not 
include such free parameters. 

Such an overhead observed can only be attributed to 
the times that are not computable within the profiler Our 
hypothesis attributes the time due to the interpreter's delay 
in invoking the profiler for each function call and return 
events. This agrees well with the observed 0(n) overhead 
time dependence on the number of function calls. In 
Figure 2, a tight-loop function was profiled with a number 
of calls, to obtain the overhead information presented 
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Fig. 2. Overhead time vs Number of Calls (Flat 
Profiler) 



The limitations of the profiler reported below include 
features that cannot be added in the current design. 

1) Memory profiling needs deeper access to the inter- 
preter than the present framework can allow. 

2) Line stepping and watch on variables are not possi- 
ble, and more appropriate for a debugger. 

3) Non-local exits are not traced; This means uncaught 
exceptions are not profiled, and would end in a 
aborting of execution. This is however classified as 
a bug in the user's Octave script program. 

We also note that, complete integration of the Profiler 
into the codebase of the GNU Octave project requires a 
different approach to creating the Profiler- API. Such an 
profiler mechanism would work by walking the Abstract 
Syntax Tree (AST), and passing function call and return 
events to the profiler. From our experimental work, we see 
this as feasibility to bring the advantages of the call-graph 
profiler to Octave, in the future. 



in the graph. The linear trend observed in the overhead 
time seems to justify the apparent constant overhead time 
for the interpreter which cannot be compensated without 
computing a constant bias factors. 

From Figure 2, the bias factor would seem to be the 
slope of the overhead time, which can be estimated to be 
around an overhead time of 8.1706x10^°^ seconds/call. 
This is however a free parameter, and dependent on 
implementation details. It should be noted that, most pro- 
filers suffer from the performance hit due to the profiling 
overhead. 

On a more general note, from the benchmark tests we 
observe the total overhead times to be less than 0.5% 
of the total program runtime. Our design has optimized 
the overhead time compared consistently from our initial 
prototype by compensating for each measurable profiling 
time. 

The Call-Graph profiler performs with a larger overhead 
compared to the Flat-profiler; we estimate a rough factor 
of 2x increase in the overhead time, for the CaU-Graph 
profiler. The explanation for this variation we think, to be 
the memory creation and cleanup associated with the data 
structures used to build the CaU-Graph. Also a non-trivial 
I/O times are associated with the Call-Graph procedures. 



VI. Conclusion 

In this paper we have demonstrated a Flat profiler 
and a Call-Graph profiler for matrix based programming 
language like GNU Octave. We have reported the profiling 
overhead, benchmark the performance for both the profiler. 
Further the limitations and possible extensions on this 
design are enumerated. 
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C. Limitations of tlie profiler 



Certain features which are not implemented at present 
are not limitations to the profiler. These include 

1) resource profiling for opened- files, network- 
cormections, database handles; 

2) arguments passed from the caller-callee function are 
not traced; 

3) event filtering, and selective profiUng. 



TABLE I. Flat-profiler results 
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